Masters

Sprint 1 Report: Hierarchical Multi-Label Architecture

This report details the implementation, refactoring, and testing process taken during Month 1 to map abstract international privacy standards into verifiable, quantifiable metrics using a structured Bayesian risk engine.

1. Config & Data Foundation

Hierarchical Configuration (privacy_indicators.json)

We pivoted from a flat metric structure to a Hierarchical Multi-Label Classification mapping. This enables fine-grained attributes to simultaneously map to overlapping frameworks (e.g., GDPR and NIST) by grouping them under universal categories.

{
	"categories": {
		"Data_Minimization": {
			"attributes": {
				"collection_necessity_score": {
					"frameworks": ["GDPR:Article_5", "NIST_Privacy_Framework:CT_DP_P"]
				}
			}
		}
	}
}

Data Loading (loader.py & download.py)

To prevent the Bayesian network from collapsing under malformed constraints, we implemented rigorous schemas using Pydantic.

class CategoryConfig(BaseModel):
    attributes: Dict[str, AttributeConfig] = Field(description="Dictionary of bottom-level attributes mapped.")

Additionally, data/download.py was automated to securely fetch and locally serialize the OPP-115 policy dataset via Hugging Face datasets, creating the offline corpus required for Month 2 fine-tuning.

2. Models Layer (privacy_bert.py)

The perceptual layer establishes a PrivacyFeatureExtractor capable of wrapping Hugging Face Transformers. The extractor tokenizes incoming privacy policies and prepares the contextual output tensors required by the risk engine.

from models.Privacy_Bert.privacy_bert import PrivacyFeatureExtractor
extractor = PrivacyFeatureExtractor()
features = extractor.extract_features("We collect your data to improve services.")
# Tensor Geometry: [Batch Size, Hidden Dimensions]

3. Risk Engine (bayesian_scorer.py)

The reasoning layer replaces arbitrary risk matrices with strict probabilistic inference using pgmpy. The BayesianRiskEngine dynamically reads the hierarchical configuration and constructs a Directed Acyclic Graph (DAG) on the fly mapping Attribute -> Category -> Framework Risk.

def build_topology_from_config(self, config: RootConfig):
    # Dynamically maps bottom-level features to higher-level frameworks
    for category_name, category_data in config.categories.items():
        for attribute_name, attribute_config in category_data.attributes.items():
            edges.append((attribute_name, category_name))
            for framework_clause in attribute_config.frameworks:
                framework_name = framework_clause.split(":")[0]
                edges.append((category_name, f"{framework_name}_Risk"))

4. Testing Process & Results (test_pipeline.py)

We implemented testing paradigms targeting the structural integrity of the DAG generation and the strictness of the Pydantic JSON boundaries. The testing pipeline successfully forces the system to translate complex JSON hierarchies into active probabilistic nodes.

End-to-End Test Results (pytest output):

=============== test session starts ===============
platform win32 -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0
collected 5 items

tests/test_pipeline.py::TestSyntheticRiskPipeline::test_config_validation PASSED                                            [ 20%]
tests/test_pipeline.py::TestSyntheticRiskPipeline::test_end_to_end_scoring_consistency PASSED                               [ 40%]
tests/test_pipeline.py::TestSyntheticRiskPipeline::test_bayesian_uncertainty_bounds PASSED                                  [ 60%]
tests/test_pipeline.py::TestSyntheticRiskPipeline::test_dynamic_topology_generation PASSED                                  [ 80%]
tests/test_pipeline.py::TestSyntheticRiskPipeline::test_privacy_bert_feature_extraction PASSED                              [100%]

================ 5 passed, 2 warnings in 5.42s ================